options(repos = list(CRAN="http://cran.rstudio.com/"))
# Load packages here
library("ggthemes")
library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library("ggplot2")
library("plyr")
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
##
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
##
## The following object is masked from 'package:purrr':
##
## compact
library("dplyr")
library("plotly")
##
## Attaching package: 'plotly'
##
## The following objects are masked from 'package:plyr':
##
## arrange, mutate, rename, summarise
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library("gganimate")
library("scales")
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
In this exercise, the goal is to create one of the most famous plots in chaos theory. The equation of the logistic map is very simple, but its behaviour stunningly complex:
\[ x_{n+1} = rx_{n}(1-x_{n}) \]
Starting with an initial value of \(x_{0}\) between one and zero, e.g. 0.5, and setting a constant value of r e.g. between zero and four, the equation is iterated forward and thereby computes \(x_{1}, x_{2}\), etc. We will only care about visualisation here, but if you are interested in learning more about the background of the equation and plot, e.g. have a look at this or this video.
The goal is to create a plot with different values of r on the x-axis and then x values on the y-axis corresponding to each r value. In parts of the plot, all these x values will be on a single point, but for other r values x moves perpetually.
The following code chunk computes the main dataset of the plot for
you. You are welcome to study the code, but this is not part of the
assignment and you do not have to worry about how exactly it works (this
is not a course about chaos theory after all). Data contained in
logistic_map_data is already in a tidy format, one variable
denotes the value of r, one variable the value of the associated x’s.
For each value of r repeated over \(n=1000\) rows, there are \(n\) associated rows of x values (these can
be constant or fluctuating, depending on the value of r). Only some
information for the colour still has to be added.
# x observations for each r value
n <- 1000
# Step between each r value
r_step <- 0.001
r_range <- seq(2.5, 4, by = r_step)
to_discard <- 500 # numbers of observations discarded before the n which are stored
logistic_map_data <- matrix(0, nrow = n*length(r_range), 2)
for (r in r_range) {
current_logistic_map_series <- numeric(n+to_discard)
current_logistic_map_series[1] <- 0.5
for (k in 1:(n+to_discard-1)) {
current_logistic_map_series[k+1] <- r*current_logistic_map_series[k]*(1-current_logistic_map_series[k])
}
start_index <- 1+n*(match(r, r_range) - 1)
end_index <- n*match(r, r_range)
logistic_map_data[start_index:end_index,1] <- r
logistic_map_data[start_index:end_index,2] <- tail(current_logistic_map_series,n)
}
logistic_map_data <- as_tibble(data.frame(logistic_map_data))
colnames(logistic_map_data) <- c("r", "x")
Hint: Create your final dataset with n <- 1000 and
r_step <- 0.001, however, for these values it takes R
some time to compute the plot. When building your plot, adjusting axes,
colours, etc., one approach is to first use e.g. n <- 10
and r_step <- 0.01 until you have a version of the plot
that you are happy with. Just note that the opacity parameter will have
to be decreased again once you have increased n because now
there are more points in the plot.
# Your code here
head(logistic_map_data)
## # A tibble: 6 × 2
## r x
## <dbl> <dbl>
## 1 2.5 0.6
## 2 2.5 0.6
## 3 2.5 0.6
## 4 2.5 0.6
## 5 2.5 0.6
## 6 2.5 0.6
#Creating a point plot with r and x
pBW <- ggplot(logistic_map_data, aes(x = r, y = x, colour=r )) +
geom_point(size = 0.01, alpha=0.01 )
#Removing y-axis label, graph background and legend
pStripped <- pBW + theme(panel.background = element_blank(), axis.text.y=element_blank(),
axis.ticks.y=element_blank(), axis.title.y = element_blank(),
legend.position = "None")
#Adding colour to the graph
pRainbow <- pStripped + binned_scale(scale_name="stepsn",palette = function(x) c("#FF6161", "#B7B730","#25BF25","#91DEDE","#CAD4FF","#FFB3F3"),breaks=c(3.5,3.6,3.7,3.8,3.9),aesthetics = "color")
pRainbow
In this exercise, try to replicate the following figure that displays
the average popularity metrics of legislators grouped by gender and
party. Note that this example first involves some reshaping of the data
which you can do with dplyr from the
tidyverse.
# Data for the plot
fb <- read.csv("data/fb-congress-data.csv", stringsAsFactors=FALSE)
# Your code here
table(fb$party)
##
## Democrat Independent Republican
## 4934 47 5019
#Choosing only relevant columns and party
fb <- fb[ , c(5, 6, 7, 8, 9, 10, 11, 12, 13, 15)]
fb <- fb[fb$party == 'Democrat' | fb$party == 'Republican',]
table(fb$party)
##
## Democrat Republican
## 4934 5019
colnames(fb)
## [1] "likes_count" "comments_count" "shares_count" "love_count"
## [5] "haha_count" "wow_count" "angry_count" "sad_count"
## [9] "gender" "party"
#Joining gender and party
fb <- fb %>%
unite("GP", gender:party)
#Gathering data
new_fb <- gather(fb, key="measure", value="value", c("likes_count", "comments_count", "shares_count", "love_count", "haha_count", "wow_count", "angry_count", "sad_count"))
#Changing strings names data
new_fb$GP = gsub("M_Democrat","D-M",new_fb$GP)
new_fb$GP = gsub("F_Democrat","D-F",new_fb$GP)
new_fb$GP = gsub("M_Republican","R-M",new_fb$GP)
new_fb$GP = gsub("F_Republican","R-F",new_fb$GP)
new_fb = new_fb %>%
mutate(value=replace_na(value,0))
#Plotting bar chart
ggplot(new_fb,aes(x=GP, y =value, fill= GP))+
geom_bar(stat = "summary",fun = 'mean')+
theme(legend.position = "none")+
facet_wrap(~measure,scales = "free",nrow=2)+guides(fill="none")+
labs(title = "Partisan asymmetries by gender in Facebook popularity metrics",
subtitle = "Female Democrats receive more engagement than Male Democrats. The opposite is true for Republicans",
x = "Party and Gender of Member of Congress",
y = "Average of each type of social metric")+
scale_y_continuous(labels = comma) +
scale_color_manual(values = c("dark blue", "blue", "dark red", "red"), aesthetics = "fill")+
theme_minimal()
For this exercise, try to replicate the plot below, which Pablo Barbera prepared for a Washington Post blog post a few years ago.
The plot combines two sources of data: The ideology estimates for
each actor (available in ideology_1.csv) and a random
sample of ideology estimates for the three density plots (in
ideology_2.csv).
As a clue, Pablo used theme_tufte from the
ggthemes package as main theme (which he then edited
manually). But there may be other ways of replicating it.
# Data for main plot
ideology <- read.csv("data/ideology_1.csv")
view(ideology)
# Data for background plots
bg <- read.csv("data/ideology_2.csv")
head(bg)
## ideology type
## 1 0.9740434 Republican
## 2 0.6166325 Republican
## 3 0.3337557 Republican
## 4 0.9350899 Republican
## 5 0.6329505 Republican
## 6 0.2592620 Republican
# Your code here
#Plotting main plot
#Generate row number or row index to table
ideology <- dplyr::mutate(ideology, id = (row_number()/20))
head(ideology)
## screen_name twscore type party twscore.sd id
## 1 @tedcruz 0.8946006 Primary Candidate Republican 0.10787669 0.05
## 2 @RealBenCarson 0.8941568 Primary Candidate Republican 0.11932281 0.10
## 3 @ScottWalker 0.8771619 Primary Candidate Republican 0.08211766 0.15
## 4 @RandPaul 0.8285642 Primary Candidate Republican 0.09440381 0.20
## 5 @rushlimbaugh 0.7938614 Media Outlets Z 0.10337143 0.25
## 6 @BobbyJindal 0.7871339 Primary Candidate Republican 0.07293965 0.30
#Baseline
i <- ggplot(ideology, mapping = aes(x=twscore, y=id, xmin=twscore-twscore.sd, xmax=twscore+ twscore.sd, colour = party, label=screen_name)) +
geom_pointrange(size=0.1) +
geom_text(size=2, hjust=1,nudge_x=-0.15)
#Adding colour
cols <- c('Republican' = 'red', 'Z' = 'black', 'Democrat' = 'blue')
iCol <- i + scale_color_manual(name = 'Party',values= cols)
#Adding theme and Removing background axis
iStripped <- iCol + theme_tufte() + theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank(), axis.title.y = element_blank(),
legend.position = "None") + labs(x = "Position on latent ideological scale", subtitle="Twitter Ideology scores of potential Democratic and Republican presidential primary candidates")
iStripped
#Plotting background plots
#Baseline
l <- ggplot(bg, aes(x=ideology, fill=type)) +
geom_density(alpha=0.2, color=NA) +
scale_x_continuous(limits = c(-2.5, 2.5))
#Adding colour and theme
lCol <- l + scale_color_manual(name = 'Party',values= cols, aesthetics = "fill") + theme_tufte()
#Calculating average
avg <- ddply(bg, "type", summarise, grp.mean=mean(ideology))
avg <- avg[-3,]
#Adding intercepts
lLines <- lCol + geom_vline(xintercept= 0, size=0.1, color="black") + geom_vline(data=avg, aes(xintercept=grp.mean, color=type), size=0.1) + scale_color_manual(values=c("blue", "red", "black"))
#Adding labels to intercepts
lLabel <- lLines + annotate("text", x=0.1, y=1, label= "Average Twitter Score", size=2.5, angle = 90) +
annotate("text", x=-1.1, y=0.5, label= "Average Democrate\n in 114th congress", size=2.5, angle = 90) +
annotate("text", x=0.7, y=1.2, label= "Average Republican\n in 114th congress", size=2.5, angle = 90)
#Adding theme and Removing background axis
lStripped <- lLabel + theme_tufte() + theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank(), axis.title.y = element_blank(),
legend.position = "None") + labs(x = "Position on latent ideological scale", subtitle="Twitter Ideology scores of potential Democratic and Republican presidential primary candidates")
lStripped
## Warning: Removed 392 rows containing non-finite values (stat_density).
I attempted to plot both graphs together in one, unfortunately I was not able to fix the sizing and proportion. Code for plotting both on same scale below:
#Merging plots together
lfinal <- lStripped + geom_pointrange(data=ideology, mapping=aes(x=twscore,y=id, xmin=twscore-twscore.sd, xmax=twscore+twscore.sd, colour = party))
lfinal2 <- lfinal +
geom_text(ideology, mapping = aes(x=twscore, y=id, colour = party, label=screen_name, hjust=1,nudge_x=-0.15))
## Warning: Ignoring unknown aesthetics: nudge_x
lfinal2
## Warning: Removed 392 rows containing non-finite values (stat_density).
In this exercises you can visualise data about a topic you are interested in.
First download the data. If you are looking for ideas, e.g. have a
look at health data from the World Health Organization,
economic data from the St. Louis
Federal Reserve, or data on wealth and income inequality from the World Inequality Database. Once you
have downloaded the data, load it into R and process it, and then
explore and illustrate it with plots created with ggplot2
and/or plotly. You can also add brief explanations through
markdown text.
More extensive, carefully thought out, polished, and well understandable answers will receive more points.
Note: Some of these data can also be obtained via APIs, but you can just manually download files such as .csv. for this assignment. This has no effect on the grade.
For this exercise, I am using an open source dataset which contains ratings of movies from year 2007 to 2011.
# Your code here
dataset <- read.csv("data/Movie_Ratings_data.csv")
head(dataset)
## Film Genre Rotten.Tomatoes.Ratings.. Audience.Ratings..
## 1 (500) Days of Summer Comedy 87 81
## 2 10,000 B.C. Adventure 9 44
## 3 12 Rounds Action 30 52
## 4 127 Hours Adventure 93 84
## 5 17 Again Comedy 55 70
## 6 2012 Action 39 63
## Budget..million... Year.of.release
## 1 8 2009
## 2 105 2008
## 3 20 2009
## 4 18 2010
## 5 20 2009
## 6 200 2009
colnames(dataset) <- c("Film", "Genre", "CriticalRating", "AudienceRating", "BudgetMillions", "Year")
#Analysing tail of dataset
tail(dataset)
## Film Genre CriticalRating AudienceRating
## 557 Your Highness Comedy 26 36
## 558 Youth in Revolt Comedy 68 52
## 559 Zack and Miri Make a Porno Romance 64 70
## 560 Zodiac Thriller 89 73
## 561 Zombieland Action 90 87
## 562 Zookeeper Comedy 14 42
## BudgetMillions Year
## 557 50 2011
## 558 18 2009
## 559 24 2008
## 560 65 2007
## 561 24 2009
## 562 80 2011
#Analysing structure of dataset
str(dataset)
## 'data.frame': 562 obs. of 6 variables:
## $ Film : chr "(500) Days of Summer " "10,000 B.C." "12 Rounds " "127 Hours" ...
## $ Genre : chr "Comedy" "Adventure" "Action" "Adventure" ...
## $ CriticalRating: int 87 9 30 93 55 39 40 50 43 93 ...
## $ AudienceRating: int 81 44 52 84 70 63 71 57 48 93 ...
## $ BudgetMillions: int 8 105 20 18 20 200 30 32 28 8 ...
## $ Year : int 2009 2008 2009 2010 2009 2009 2008 2007 2011 2011 ...
#Getting stats for the dataset
summary(dataset)
## Film Genre CriticalRating AudienceRating
## Length:562 Length:562 Min. : 0.0 Min. : 0.00
## Class :character Class :character 1st Qu.:25.0 1st Qu.:47.00
## Mode :character Mode :character Median :46.0 Median :58.00
## Mean :47.4 Mean :58.83
## 3rd Qu.:70.0 3rd Qu.:72.00
## Max. :97.0 Max. :96.00
## BudgetMillions Year
## Min. : 0.0 Min. :2007
## 1st Qu.: 20.0 1st Qu.:2008
## Median : 35.0 Median :2009
## Mean : 50.1 Mean :2009
## 3rd Qu.: 65.0 3rd Qu.:2010
## Max. :300.0 Max. :2011
#Getting all column names
colnames(dataset)
## [1] "Film" "Genre" "CriticalRating" "AudienceRating"
## [5] "BudgetMillions" "Year"
#Calculating average of ratings for each genre
avgGenre<- ddply(dataset, "Genre", summarise, Audience.mean=mean(AudienceRating), Critical.mean=mean(CriticalRating))
avgGenre
## Genre Audience.mean Critical.mean
## 1 Action 58.72078 44.40260
## 2 Adventure 62.72414 53.10345
## 3 Comedy 56.40698 44.91860
## 4 Drama 64.42574 56.47525
## 5 Horror 47.38776 34.57143
## 6 Romance 62.33333 48.19048
## 7 Thriller 65.58333 59.08333
Using plotly, the visualisation represents the whole dataset. The critical rating can be compared to audience rating; sizes of the point represents budget size, and each genre is given a specific colour.
#Plotting critical rating against audience rating, size represent budget and colour genre
movies <- ggplot(dataset, aes(x=CriticalRating, y=AudienceRating, colour=Genre, size=BudgetMillions, label=Film)) +
geom_point(alpha=0.5)
# Bigger Bubble represent movies with a bigger budget
IMovies <- ggplotly(movies)
IMovies
For further analysis and to make it more engaging, the animated visualisation below analyses how ratings change across the years 2007 to 2011 compared to its average for each genre. The library gganimate was used together with plotly to achieve the results.
From the animated visualisation, it is easy to compare the number of movies released across the years. It is also apparent that action movies have the most releases and high budget films. Romance genre has the least number of releases.
With the line intercepts present in each graph, it is also easy to analyse how each film’s rating change compared to the average. As seen, Thriller and Drama genres have the highest audience and critical ratings.
Reproduced from ThomasP85(version 1.0.8.9), R Graph Gallery. https://github.com/thomasp85/gganimate
movies <- ggplot(dataset, aes(CriticalRating, AudienceRating, size=BudgetMillions, colour = Genre, label=Film)) +
geom_point(alpha = 0.7, show.legend = FALSE) +
#Adding average intercepts
geom_vline(data=avgGenre, aes(xintercept=Critical.mean, color=Genre),
size=0.4) +
geom_hline(data=avgGenre, aes(yintercept=Audience.mean, color=Genre),
size=0.4) +
#Adding annotations
annotate("text", x=50, y=15, label= "Average Critical Rating", size=3,
angle = 90, color='grey') +
annotate("text", x=1, y=58, label= "Average Audience Rating", size=3,
angle = 0, color='grey') +
#Size and scale range
scale_size(range = c(1, 13)) +
scale_x_log10(oob = scales::squish_infinite) +
facet_wrap(~Genre, nrow = 2) +
#Adding labels and titles
labs(title = 'Year: {frame_time}', x = 'Critical Rating',
y = 'Audience Rating',
subtitle = 'Ratings of different genre movies from 2007 to 2011')
#Animating using gganimate
Anim_movies <- movies + transition_time(Year) +
ease_aes('linear') +
theme_few()
#Animate
animate(Anim_movies, width=1200, height=1000)
## Warning: Transformation introduced infinite values in continuous x-axis